Code
library(tidyverse)
library(here)
library(gganimate)library(tidyverse)
library(here)
library(gganimate)Both of our data sets have been sourced from the GapMinder website.
For our first data set we chose one that focuses on the GDP ( Gross Domestic Product) per capita of almost 200 countries across the world. GDP per capita takes the whole economic output of a country, and divides it by the country’s population to understand the economic output per person in a country. Lower GDPs per capita means the economy of the given country is poor. The data we have chose highlights the change in GDP per capita from 1950 to 2025 throughout the given countries.
Our second data is centered around Child Mortality rates in countries across the world. The rate of child mortality is measured as the amount of deaths to children 5 years or younger for every 1,000 children born in a country. This data set includes almost 200 countries from around the world and gives us their child mortality rates from 1950 - 2025.
gdp_data <- read_csv(here::here("gdp_pcap.csv"))
child_mortality_data <- read_csv(here::here("child_mortality_0_5_year_olds_dying_per_1000_born.csv"))child_mortality_long <- child_mortality_data |>
select(country, `1950`:`2025`) |>
pivot_longer(cols = `1950`:`2025`, names_to = "Year", values_to = "Death_per_1000") |>
filter(!is.na(Death_per_1000)) gdp_data_long <- gdp_data |>
select(country, `1950`:`2025`) |>
pivot_longer(cols = `1950`:`2025`, names_to = "Year", values_to = "GDP") |>
filter(!is.na(GDP)) |>
mutate(GDP = if_else(str_detect(GDP, "k"), as.numeric(str_replace(GDP, "k", "")) * 1000, as.numeric(GDP)))We are combining our two data sets, GDP per capita and Child Mortality Rates, to examine what type of relationship, if any, there is between the two variables. We are testing how GDP per capita has affected Child Mortality Rates in countries across the world from 1950 to 2025, and if we can find any association between the two. Our hypothesis is that a lower GDP per Capita will be positively associated with a higher Child Mortality Rate. We believe this to be true because the lower economic standing a country has, the harsher living conditions there tends to be meaning an increased possibility of child mortality.
Total_data <- child_mortality_long |>
inner_join(gdp_data_long, join_by("country", "Year")) |>
rename("Country" = country)The statistical method we’re using for this is linear regression. Linear regression is a statistical method that predicts the linear relationship between a quantitative response variable and one or more quantitiative explanatory variables. Specifically in this report, we are using a simple linear regression model, which takes in one explanatory variable, that predicts the linear relationship between our explanatory variable, GDP, and the response variable, Child Mortality rates.
=======The statistical method we’re using for this is linear regression. Linear regression is a statistical method that predicts the linear relationship between a quantitative response variable and one or more quantitative explanatory variables. In this report, we are using a simple linear regression model which takes in one explanatory variable, that predicts the linear relationship between our explanatory variable, GDP, and the response variable, Child Mortality rates.
>>>>>>> 3b7ab264dbf0d0ff47edef107ab3252154d84e15avg_Total_data <- Total_data|>
group_by(Country)|>
summarize(avg_GDP = mean(GDP), avg_Death_per_1000 = mean(Death_per_1000))From the animate plots we are able to see the relationship between Time and both GDP and Child Mortality. Throughout the 75 year period we chose, 1950 to 2025, we can see generally that GDP has increased exponentially over time and Child Mortality has decreased significantly. We can see there is an inverse association between the two variables as when GDP goes up, Child Mortality goes down. Through the animation we can see dots moving in a diagonal direction indicating the possibility of a linear relationship between the two variables.
bubble <- ggplot(Total_data, aes(x = GDP, y = Death_per_1000, size = GDP)) +
geom_point() +
theme_bw() +
labs(title = 'Year: 1950-2025', x = 'GDP', y = NULL, subtitle = "Child Mortality") +
transition_time(as.numeric(Year)) +
ease_aes('linear')
gganimate::transition_states(
Year,
transition_length = 1,
state_length = 1
) bubble <- ggplot(Total_data, aes(x = GDP, y = Death_per_1000, size = GDP)) +
geom_point() +
theme_bw() +
labs(title = 'Year: 1950-2025', x = 'GDP', y = 'Child Mortality') +
transition_time(as.numeric(Year)) +
ease_aes('linear')
gganimate::transition_states(
Year,
transition_length = 1,
state_length = 1
) <ggproto object: Class TransitionStates, Transition, gg>
adjust_nframes: function
expand_data: function
expand_layer: function
expand_panel: function
finish_data: function
get_all_row_vars: function
get_frame_data: function
get_frame_vars: function
get_row_vars: function
map_data: function
mapping: (.*)
params: list
remap_frames: function
require_late_tween: function
setup_params: function
setup_params2: function
static_layers: function
unmap_frames: function
var_names: states
super: <ggproto object: Class TransitionStates, Transition, gg>
animate(bubble, renderer = gifski_renderer())ggplot(avg_Total_data, aes(x = avg_GDP, y = avg_Death_per_1000)) +
geom_point() +
labs(x = "Average GDP", y = NULL, subtitle = "Average Child Mortality per 1000 Births", title = "Relationship between Average Child Mortality and Average GDP over time")\[\hat{y} = 123 - .0024x, \text{where}\] \[\hat{y} = \text{Predicted Average Child Mortality Rate} \text{ and } x = \text{Average GDP}\]
From this equation we can gather that when the Average GDP of a country is $0, the predicted Average Child Mortality Rate is 123 deaths per 1000 children. We can also see that with every one dollar increase in Average GDP, the predicted Average Child Mortality rate goes down by .0024 deaths per 1000 children.
linear_regression_model <- lm(avg_Death_per_1000 ~ avg_GDP, data = avg_Total_data)
summary(linear_regression_model)
Call:
lm(formula = avg_Death_per_1000 ~ avg_GDP, data = avg_Total_data)
Residuals:
Min 1Q Median 3Q Max
-85.55 -44.28 -18.56 37.85 149.78
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.230e+02 5.071e+00 24.262 <2e-16 ***
avg_GDP -2.372e-03 2.405e-04 -9.866 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 53.6 on 193 degrees of freedom
Multiple R-squared: 0.3353, Adjusted R-squared: 0.3318
F-statistic: 97.34 on 1 and 193 DF, p-value: < 2.2e-16
From the linear regression model above, we can estimate that about 33.53% of the variability in the response values are explained by the regression model. This suggests that we have a weak to moderate model for explaining the variability in the data. We can see that the fitted values, which represent the variability of Average Child Mortality rates accounted for by the Average GDP, makes up only 1441.605 of 4299.856 of the response values, which is about .3353, the same as the R-Square value. The residuals, 2858.251 out of the 4299.856 response values, is the rest of the unexplained variability in the data set that isn’t represented in the model.
variance_response <- var(avg_Total_data$avg_Death_per_1000)
variance_fitted <- var(fitted(linear_regression_model))
variance_residuals <- var(residuals(linear_regression_model))
formatted_table <- data.frame(Variance = c(variance_response, variance_fitted, variance_residuals),
Source = c("Response Values", "Fitted Values", "Residuals"))
print(formatted_table) Variance Source
1 4299.856 Response Values
2 1441.605 Fitted Values
3 2858.251 Residuals
predictions <- predict(linear_regression_model, avg_Total_data)
simulated_values <- predictions + rnorm(length(predictions), 0, sigma(linear_regression_model))
simulated_formatted_table <- data.frame(avg_GDP = avg_Total_data$avg_GDP, avg_Death_per_1000 = simulated_values)
ggplot() +
geom_point(data = avg_Total_data, aes(x = avg_GDP, y = avg_Death_per_1000), color = "red", alpha = 0.5) +
geom_point(data = simulated_formatted_table, aes(x = avg_GDP, y = avg_Death_per_1000), color = "blue", alpha = 0.5) +
labs(x= "Average GDP", y = NULL, subtitle ="Average Child Mortality per 1000", title = "Comparison of Observed and Simulated Data")set.seed(9531)
num_simulations <- 1000
r_squared_values <- numeric(num_simulations)
for (i in 1:num_simulations) {
simulated_values <- predictions + rnorm(length(predictions), 0, sigma(linear_regression_model))
simulated_formatted_table <- data.frame(avg_GDP = avg_Total_data$avg_GDP, avg_Death_per_1000 = simulated_values)
simulated_linear_regression_model <- lm(simulated_formatted_table[,1] ~ simulated_formatted_table[,2], data = formatted_table)
r_squared_values[i] <- summary(simulated_linear_regression_model)$r.squared
}
ggplot(mapping = aes(x = r_squared_values)
) +
geom_histogram(binwidth = .02, color = "black", fill = "blue") +
labs(x = "R-Squared",
y = NULL,
subtitle = "Frequency",
title = "R-Squared Distribution"
)